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Abstract 



Intractable distributions present a common difficulty in inference within the proba- 
bilistic knowledge representation framework and variational methods have recently been 
popular in providing an approximate solution. In this article, we describe a perturbational 
approach in the form of a cumulant expansion which, to lowest order, recovers the stan- 
dard Kullback-Leibler variational bound. Higher-order terms describe corrections on the 
variational approach without incurring much further computational cost. The relation- 
ship to other perturbational approaches such as TAP is also elucidated. We demonstrate 
the method on a particular class of undirected graphical models, Boltzmann machines, 
for which our simulation results confirm improved accuracy and enhanced stability during 
learning. 

1. Introduction 

In recent years, interest has steadily grown over many associated fields in representing and 
processing information in a probabilistic framework. Arguably, the reason for this is that 
prior knowledge of a problem domain is rarely absolute and the probabilistic framework 
therefore arises naturally as a candidate for dealing with this uncertainty. Better known 
examples of models are Belief Networks (Pearl, 1988) and probabilistic neural networks 
(Bishop, 1995; MacKay, 1995). However, incorporating many prior beliefs can make models 
of domains so ambitious that dealing with them in a probabilistic framework is intractable, 
and some form of approximation is inevitable. One well-known approximate technique is 
Monte Carlo sampling (see e.g., Neal, 1993; Gilks, Richardson, & Spiegelhalter, 1996), 
which has the benefit of universal applicability. However, not only can sampling be pro- 
hibitively slow, but also the lack of a suitable convergence criterion can lead to low confidence 
in the results. Recently, variational methods have provided a popular alternative, since they 
not only approximate but can also bound quantities of interest giving, potentially, greater 
confidence in the results (Jordan, Gharamani, Jaakola, Sz Saul, 1998; Barber &: Wiegerinck, 
1999; Barber & Bishop, 1998). The accuracy of variational methods, however, is limited by 
the flexibility of the variational distribution employed. Whilst, in principle, the flexibility, 
and therefore, the accuracy of the model, can be increased at will (see e.g., Jaakkola &: Jor- 
dan, 1998; Lawrence, Bishop, & Jordan, 1998), this generally is at the expense of incurring 
even greater computational difficulties and optimization problems. 
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In this article, we describe an alternative approach which is a perturbation around stan- 
dard variational methods. This has the property of potentially improving the performance 
at only a small extra computational cost. Since most quantities of interest are derivable 
from knowledge of the normalizing constant of a distribution, we present an approximation 
of this quantity which, to lowest order, recovers the standard Kullback-Leibler variational 
bound. Higher-order corrections in this expansion are expected to improve on the accuracy 
of the variational solution, despite the loss of a strict bound. 

In section (2) we will briefly discuss why the normalizing constant of probability distri- 
butions is of such importance. We briefly review the Kullback-Leibler variational bound in 
section (3) before introducing the perturbational approach in section (4). This approach 
is illustrated on a well-known class of undirected graphical models, Boltzmann machines, 
in section (5), in which we also explain the relation of this approach to other perturba- 
tional methods such as TAP (Thouless, Anderson, Sz Palmer, 1977; Kappen Sz Rodriguez, 
1998a). We conclude in section (6) with a discussion of the advantages and drawbacks of 
this approach compared to better known techniques. 



2. Normalizing Constants and Generating Functions 

We consider a family of probability distributions Q over a vector of random variables 
s = {si\i 6 [1, . . . ,N]}, parameterized by w, 1 

exp(H(s;wj) 

Q(S; ™ )= Z(w) ■ (1) 
With a slight abuse of notation, we write the normalizing constant as 

Z(w) = Jds exp(fT(a;to)) . (2) 

The 'potential' H(s; w) thus uniquely determines the distribution Q. We have assumed here 
that the random variable s is continuous — if not, the corresponding integral (see equation 
2) should be replaced by a summation over all possible discrete states. 

One approach in statistics to obtain marginals and other quantities of interest is given 
by considering generating functions (Grimmett Sz Stirzaker, 1992), which take the form 

G(X;w) = J ds exp (log Q(s; w) + X-s) (3) 

Averages of variables with respect to Q are given by derivatives of G(X; w) with respect to 
A for fixed to, so that, for example, (S1S2) = d 2 G(X; w)/ d\\d\2, evaluated at A = 0. We 
can equally well consider the function 

Z(X;w) = J dsexp(H(s;w) + X-s) (4) 

so that 

1 d 2 Z(X 1 ,X 2 ;w 



(sis 2 ) 



Z(Ai,A 2 ;w) <9Ai<9A 2 



(5) 

Ai,A 2 =0 



1. We assume that the distribution is strictly positive over the domain. 
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Formally, all quantities of interest can be calculated from "normalizing constants" of dis- 
tributions, possibly modified in a suitable manner, such as in (4) above. It is clear that 
all moments of Q can be generated by differentiating (4) in a similar manner, and can be 
as equally easily obtained from the generating function approach, as from the normalising 
constant approach. We prefer here the normalising constant approach since this enables us 
more easily to make contact with results from statistical physics. The normalising constant 
approach is also more natural for the examples of undirected networks that we shall consider 
in later sections. 

In the following sections, we layout how to approximate the normalising constant, as- 
suming that any additional terms in the normalising constant, of the form As in equation 
(4), are absorbed into the definition of the potential H. 

As we will see in section (5.2.3), the normalising constant also plays a central role 
in learning. Unfortunately, for many probability distributions that arise in applications, 
calculation of the normalizing constant is intractable due to the high dimensionality of the 
random variable s. When s is a iV-dimensional binary vector, naively at least, a summation 
over 2 N states has to be performed to calculate the normalizing constant. 

However, sometimes it is possible to exploit the structure of the distribution in or- 
der to reduce the computational expense. A trivial example is that of factorised models, 
Q(s) oc rii*( s i); i n which Z = Jds = Hi I dsi ^ (s^) , so that the computation 

scales only linearly with N. Similarly, in the case of a Gaussian distribution, computa- 
tion scales (roughly) with TV 3 . These tractable distributions have recently been exploited 
in the variational method to approximate intractable distributions, see for example (Saul, 
Jaakkola, Sz Jordan, 1996; Jaakkola, 1997; Barber Sz Wiegerinck, 1999; Jordan et al., 1998). 
In the following section we will briefly review one of the most common variational methods 
which exploits the Kullback-Leibler bound. 



3. The Kullback-Leibler Variational Bound 

Our aim in this section is to briefly describe the current state-of-the-art in approximating the 
normalizing constant of an intractable probability distribution. We denote the intractable 
distribution of interest as Q\ to distinguish it from an auxiliary distribution Qo ? we will 
introduce. Without loss of generality, we write the distribution over a vector of random 
variables s, parameterized by a vector w as 

e i?i(S;lfl) 

Qi(s;w) = — , (6) 

Zi{w) 

where the potential H\ (s; w) is a known function of s and w, e.g., in the case of a factorised 
model this would have the form of H\ = J2k w kSk- The corresponding (assumed intractable) 
normalizing constant is given by 



Zi 



(w) = jdse H ^ s ^ . (7) 



We will use a family of tractable distributions, Qo? parameterized by 6 which we write as 

,H {S;0 



Qo(s;0)= with Z (0)= dse H °^ (8) 

Z (O) J 
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where, by assumption, the normalizing constant Zq is tractably computable. 

Standard variational methods attempt to find the best approximating distribution Qo(s; 9*) 
that matches Q\{s\w) by minimizing the Kullback-Leibler divergence, 

KL(Q , Qi) =[ds Q (s; 9) log ^ G) . (9) 
J Qi(s;w) 

Since, using Jensen's inequality, KL(Qo, Qi) > 0, we immediately obtain the lower bound 
logZi > - J dsQ {s; 9) log Q {s; 9) + J ds Q {s; 9)H 1 {s;w) . (10) 

Using the definition (8) we obtain the bound in the more intuitive form, 

logZi > log Z + jdsQ (s;0) {H^s; w) - H {s; 6)) (11) 

The lower bound is made as tight as possible by maximizing the right-hand side with respect 
to the parameters 9 of Qo, which corresponds to minimising the Kullback-Leibler divergence 
(9). This approach has recently received attention in the artificial intelligence community 
and has been demonstrated to be a useful technique (Saul et al., 1996; Jaakkola, 1997; 
Barber & Wiegerinck, 1999). Note that whilst this lower bound is useful in providing an 
objective comparison of two approximations to the normalising constant (the approximation 
with the higher bound value is preferred), it does not translate directly into a bound on 
conditional probabilities. An upper bound on the normalising constant is also required in 
this case. Whilst, in some cases, this may be feasible, in general this tends to be a rather 
more difficult task, and we restrict our attention to the lower bound (Jaakkola, 1997). 

In the next section, we describe another approach that enables us to exploit further such 
tractable distributions. 

4. The Variational Cumulant Expansion 

In order to extend the Kullback-Leibler variational lower bound with a view to improving its 
accuracy without much greater computational expense, we introduce the following family 
of probability distributions Q a , parameterized by a, w and 9: 

H a (S;W,9) 

Q a (s;w,9)= (12) 
Z a (w,G) 

where the potential is given by 

H a {s;w,G) = aH 1 (s;w) + (l-a)H (s;9) . (13) 

The probability distributions in this family interpolate between a tractable distribution, Qo 
which is obtained for a = 0, and an intractable distribution, Q\ which is obtained for a = 1. 
For intermediate values of a 6 (0, 1), the distributions remain intractable. The normalizing 
constant of a distribution from this family is given by 

logZ a (tM) = logJdse Ho ( s ^M^™)- H °(s-,9)) 

= logZo + log(e< H ^ s ^- H °( s &)) o , (14) 
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where (.} denotes expectation with respect to the tractable distribution Qq. The final term 
in (14) is intractable for any and we shall therefore develop a Taylor series expansion 

for it around a = 0. That is, 

2 

+ y (H L (s; w) - Ho(s) - (H^s; w)) + {H (s)) ) 2 + 0(a 3 ) (15) 

In terms of the random variable AH = Hi(s; w) — Ho(s; 6), we see that (15) contains the 
first terms in a power series in AH, the first being the mean, and the second the variance 
of AH. More formally, we recognize the final term in (14) as the cumulant generating 
function K^fj(a) = \og(e aAH ^^ with respect to the random variable AH (Grimmett & 

Stirzaker, 1992). If the logarithm of the moment generating function ^e aAff ^ is finite 

in the neighborhood of the origin then the cumulant generating function K&H(a) has a 
convergent Taylor series, 

OO -I 

K AH {a) = -X n (AH)a n , (16) 

n=l 

where k^(AH) is the nth cumulant of AH with respect to the distribution Qq. (The 
definition of the nth cumulant is k® (AH) = Jp^ log (e aAH ) a =o)- For convenience we have 



no longer denoted the explicit dependence on the parameters 6 and w. 

Since the intractable normalizing constant Z\ corresponds to setting a = 1 in equation 
(14), we can write the Taylor series as 

log Z 1 = log Z + £ -k° n (AH) + ^—y k f +1 (AH) , (17) 

where the remainder follows from the the Mean Value Theorem (see, for example Spivak 
(1967)), and kf +1 (AH) is the (I + l)th cumulant of AH with respect to Q 6 < £ < 1. An 
approximation is obtained by simply neglecting this (intractable) remainder and using the 
truncated expansion: 

' 1 

log Z l « log Z + -,k° n (^H) , (18) 
, n! 

n—l 

Note that the right-hand side of equation (18) depends on the free parameters 6, and we shall 
discuss later how these parameters can be set to improve the accuracy of the approximation. 

In the following two subsections, we will look in more detail at the first and second-order 
approximation of this Taylor series, since these illuminate the variational method and the 
difficulties in retaining a lower bound. 

4.1 First Order and the Variational Bound 

The Taylor series (equation (17)) up to first order yields 

log Z l = log Z + (AH) Q + ^4(AH) , (19) 
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where the remainder k\{AH) is equal to the variance (the second cumulant) of AH with 
respect to Q 5 with < £ < 1, i.e., &f = var^Aif) = ^AH 2 ^ - (AH)^ 2 , and (.} ? denotes 

expectation with respect to the distribution Q^. Since the variance is nonnegative for any 
value of £, we have 

logZi >logZ + (AH) , (20) 

so that log Z\ is not only approximated but also bounded from below by the right hand side 
of equation (20). This bound is the standard mean field bound (see also equation (10)), 
and is employed in many variational methods (see, for example Saul et al., 1996; Jaakkola, 
1997; Barber & Wiegerinck, 1999). This bound can then be made as tight as possible by 
optimizing with respect to the free parameters 6 of the tractable distribution Qq. 

4.2 Second and Higher Order 

To second order equation (17) yields 

log Z l = log Z + (AH) + ^vav (AH) + ifcf (AfT) , (21) 

with the remainder &f = (AH 3 ^- 3^AH 2 ^(AH}^+ 2(Ai/)jj and < f < 1. Since this 

remainder cannot be bounded in a tractable manner for general distributions Qo an d Qi, 
we have no obvious criterion to prefer one set of parameters 6 of the tractable distribution 
over another. Similarly, for higher-order expansions it is unclear, within this expansion, 
how to obtain a tractable bound and, consequently, how to select a set of parameters 6. 
We stress that any criterion is essentially a heuristic since the error made by the resulting 
approximation is, by assumption, not tractably computable. Whilst, in principle, a whole 
range of criteria could be considered in this framework, we mention here only two. The 
first, the bound criterion, is arguably the simplest and most natural choice. The second 
criterion we briefly mention is the independence method, which we discuss mainly for its 
relation to the TAP method of statistical physics (Thouless et al., 1977). 

4.2.1 Bound Criterion 

The first criterion assigns the free parameters 6 to maximize the (first-order) bound, equa- 
tion (20). The resulting distribution Qq(s;6*) is used in the truncated approximation, 
equation (18). To second order this is explicitly given by 

log Z x « log Zl + (AH)* + ^var*(Atf ) , (22) 

where a '*' denotes that the distribution Qq(s;6*) is to be used. Note that the computa- 
tional cost involved over that of the variational method of section (4.1) is small. No further 
optimization is required, only the calculation of the second cumulant var^Ai?) which by 
assumption is tractable. 



440 



Variational Cumulant Expansions for Intractable Distributions 



4.2.2 Independence Criterion 

The second criterion we consider is based on the fact that the exact value of Z\ is indepen- 
dent of the free parameters 6, i.e., 

Since log Z\ is intractable, we substitute the truncated cumulant expansion into equation 
(23), 



_d_ 

del 



(logZo + E^OAff)) = 0. (24) 

\ n=l ' / 



In general, there are many solutions resulting from the independence criterion and alone it 
is too weak to provide reliable solutions. Under certain restrictions, however, this approach 
is closely related to the TAP approach of statistical mechanics (Thouless et al., 1977; Plefka, 
1982; Kappen & Rodriguez, 1998a, 1998b), as will also be described in section (5.1.2). 



5. Boltzmann Machines 

To illustrate the foregoing theory, we will consider a class of discrete undirected graphical 
models, Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985). These have application 
in artificial intelligence as stochastic connectionist models (Jordan et al., 1998), in image 
restoration (Geman & Geman, 1984), and in statistical physics (Itzykson & Drouffe, 1989). 
The potential of a Boltzmann machine, with binary random variables Sj 6 {0, 1}, is given 
by 

H(s;w) = Y^ w i s i + \ Yl Wi 3 SiS 3 ( 25 ) 

with Wij = Wji and wa = 0. Unfortunately, Boltzmann machines are in general intractable 
since calculation of the normalizing constant involves a summation over an exponential 
number of states. In an early effort to overcome this intractability, Peterson and Anderson 
(1987) proposed the variational approximation using a factorised model as a fast alterna- 
tive to stochastic sampling. However, Galland (1993) pointed out that this approach often 
fails since it inadequately captures second order statistics of the intractable distribution. 
Using the potentially more accurate TAP approximation did not overcome these difficulties 
and often lead to unstable solutions (Galland, 1993). Furthermore, it is not clear how to 
extend the TAP approach to deal with using more complex, non-factorised approximating 
distributions. We examine the relationship of our approach to the TAP method in sec- 
tion (5.1.2). Another approach which aims to improve the accuracy of the correlations, 
though not the normalising constant, is linear response (Parisi, 1988) which was applied to 
Boltzmann machines by Kappen and Rodriguez (1998a) for the case of using a factorised 
model, see section (5.1). The approach we take here is a little different in that we wish 
to find a better approximation primarily to the normalising constant. Approaches such as 
linear response can then be applied to this approximation to derive approximations for the 
correlations, if desired. 
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More flexible approximating distributions have also been considered in the variational 
approach, for example, mixtures of factorised distributions (Jaakkola Sz Jordan, 1998; 
Lawrence et al., 1998), and decimatable structures (Saul & Jordan, 1994; Barber & Wiegerinck, 
1999). We show how to combine the use of such more flexible approximating distributions 
with higher-order corrections to the variational procedure. For clarity and simplicity, we 
will initially approximate the normalizing constant of a general intractable Boltzmann ma- 
chine Qi, see equation (25), using only a factorised Boltzmann machine as our tractable 
model Qq. Subsequently, in our simulations, we will use a more flexible tractable model, a 
decimatable Boltzmann machine. 

5.1 Using Factorised Boltzmann Machines 

The potential of a factorised Boltzmann machine is given by 

H o = J2 i s i- (26) 

i 

For this simple model, calculating higher-order cumulants is straightforward once the value 
of the free parameters 6 are known since, 

(s h s i2 ...s k ) = <s il ) <s i2 ) • • • (s k ) with h / i 2 / . (27) 

For notational convenience, we define mi to be the expectation value of Sj, 

rrn = (si) = (l + exp(-^))" 1 . (28) 



5.1.1 Bound Criterion 



Using a factorised distribution to approximate the intractable normalizing constant corre- 
sponding to equation (25) gives, from equation (20), the variational bound 



log Zi>S (m) + ^2 w i m i + WijUiiUij , 



h3 



where, for convenience, we have defined the entropy 

S(m) = - ^2 l {m i \ogm i + (1 -mj)log(l - mi)} . 



(29) 



(30) 



One can either maximize the bound, i.e., the right-hand side of equation (29), directly, or 
use the fact that at the maximum, 



d 
30 k 



S (m) + W i m i + 2 w ij m i m j 



0, 



(31) 



which leads to the fixed point equation (with the understanding that the means m are 
related to the parameters by equation (28)) 



Qi = m + w ij m i 



(32) 
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The well-known "mean field" equations (Parisi, 1988) are a re-expression of (32) in terms 
of the means only, given by (28). The optimal parameters of the factorised model might 
therefore also be obtained by iterating, either synchronously or asynchronously (Peterson 
Sz Anderson, 1987), these fixed point equations. Note that equation (31) also holds for 
minima and saddle-points, so in general additional checks are needed to assert that the 
parameters correspond to a maximum of (29). Insertion of the fixed point solution into the 
second-order expansion for the bound criterion, equation (22), yields (up to second order) 

logZi pa S(m) Y^ w i m i^ Yl Wi i mim j\ X^i/ 771 ^ 1 ~ mi)mj(l - rrij), (33) 

i i,j i,j 

where we emphasize that the rrii are given by the variational bound solution, i.e., equation 
(32). 

5.1.2 Independence Criterion 

Using a factorised distribution to approximate the intractable normalizing constant, see 
equation (25), gives up to second order 

logZi ps S(m) + Y^mm + ^^mjmrnj + 2^Wij 2 mi (1 - mi)rrij (1 - rrij) 

i i,j 




Wi. 



^Wijirij] mj(l-mj) . (34) 



The independence criterion leads to the equations 
/ / \ 2 \ 



^2wij 2 rrij (1 - rrij) + ( 0* - Wi -^w^rrij ] 0- - m^\ 
V i V i J J 



^2 ( 6j - Wj - Wj k m k Wijirij (1 - rrij) . (35) 
j \ k J 



In general, there are many solutions to these equations and the search can be limited by 
inserting in equation (34) the constraint that the second-order solution is close to the first- 
order solution, i.e., 6i = wi + Y^j w ij m j + 0(w 2 ), and by neglecting terms of 0(w 4 ) and 
higher. This simplifies equation (34) to — Ftap, the negative TAP free energy 2 (Thouless 
et al., 1977; Plefka, 1982; Kappen Sz Rodriguez, 1998a). The independence criterion, i.e., 
dF^Ap/dO = 0, leads now to the fixed point condition 

0i = Wi + Y^ Wijrrij + i ^2 Wij 2 (l - 2m,j)mj{\ - rrij) , (36) 
i i 

These TAP equations can have poor convergence properties and often produce a poor 
solution (Bray & Moore, 1979; Nemoto & Takayama, 1985; Galland, 1993). The additional 



2. In Thouless et al. (1977), Plefka (1982) and Kappen and Rodriguez (1998a) the random variables si 6 
{-!,!}■ 



443 



Barber & van de Laar 




Figure 1: The Boltzmann machine with the random variable si can be decimated to a 
Boltzmann machine without s\. 



constraint that the TAP solution should correspond to a minimum of Ftap improves the 
solution. However, since TAP is therefore essentially an expansion around the first-order 
solution, we expect the numerical difference between the bound criterion and convergent 
TAP results to be small. For these reasons, we prefer the straightforward bound criterion 
and leave other criteria for separate study. 

5.2 Numerical Results 

In this section we present results of computer simulations to validate our approach. In 
section (5.2.1) we will compare the different methods of section (4.1) and section (4.2) on 
approximating the normalizing constant Z. A similar comparison is made for approximating 
the correlations in section (5.2.2). In section (5.2.3) we show the results of a learning problem 
in which a Boltzmann machine is used to learn a set of sample patterns, a typical machine 
learning problem. In all cases, the numbers of random variables in the Boltzmann machines 
are chosen to be small to facilitate comparisons with the exact results. 

In some simulations we will not only use factorised Boltzmann machines but also more 
flexible models that can make the variational bound (see equation (20)) tighter, namely 
decimatable Boltzmann machines (Saul & Jordan, 1994; Ruger, 1997; Barber & Wiegerinck, 
1999). Decimation is a technique that eliminates random variables so that the normalizing 
constant for the distribution on the remaining variables remains unchanged up to a constant 
known factor. For completeness, we briefly describe this technique in the current context. 

Suppose we have a Boltzmann machine with many random variables, for which a three- 
variable subgraph is depicted in Figure 1(a), so that random variable si is connected only 
to random variables S2 and S3. Mathematically, the condition for invariance of Z after 
removing random variable s\, Figure 1(b), becomes 

gfl' +0' 2 S2+0' 3 S3+0' 23 S2S 3 _ ^2 e 0+0lS 1 +02S2+03S3+012SlS 2 +013SlS3+023S2S3 (37) 
S\si S 
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o 

o o 



o 



o 



o o 
o 

(a) factorised model 





(b) decimatable 



(c) fully connected 



Figure 2: Boltzmann machines of eight random variables 



where we include the constant 9 for convenience. Invariance is fulfilled if 



6' 2 = 6 2 + log 



1 + e' 



1 + e"i 



8' 3 = 3 + log 



1+e' 



i+0i 



1 + e"i 



(l + eM (1 + 6*1+012+013) 
^ = ^ + log \ 1 + e ,/ + ,\ a)(1 + ^ 1+ , ia / and 0' = 9 + log (l + e *) . 



(38) 



For decimatable Boltzmann machines, one can repeat the decimation process until, fi- 
nally, one ends up with a Boltzmann machine which consists of a single random variable 
whose normalizing constant is trivial to compute. This means that for decimatable Boltz- 
mann machines the normalizing constant can be computed in time linear in the number 
of variables. Decimation in undirected models is similar in spirit to variable elimination 
schemes in graphical models. For example, in directed belief networks, "bucket elimina- 
tion" enables variables to be eliminated by passing compensating messages to other buckets 
(collections of nodes) in such that the desired marginal on the remaining nodes remains the 
same (Dechter, 1999). 

5.2.1 Approximating the Normalizing Constant 

As our 'intractable' distribution Qi, we took a fully connected, eight-node Boltzmann ma- 
chine, which is not decimatable (see Figure 2(c)). As our tractable Boltzmann machine, Qo, 
we took both the factorised model of Figure 2(a) and the decimatable model of Figure 2(b). 
We then tried to approximate the normalizing constant for the fully connected machine 
in which the parameters were randomly drawn from a zero-mean, Gaussian distribution 
with unit variance. See the appendix for additional theoretical and experimental details 
regarding the optimization scheme used. The relative error of the approximation, 



E 



\ogZ 



exact 



log Z a PP rox 



log Z exact 



(39) 



was then determined for both the first and second-order approach using both the factorised 
and decimatable model. This was repeated 550 times for different random drawings of 
the parameters of the fully connected Boltzmann machine. The resulting histograms are 
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depicted in Figure 3. In these histograms one can clearly see that the bound, while present 
in the first-order approach, is lost in the second-order approach since some normalising 
constant estimates were above the true value, violating the lower bound. In five out of 
1100 cases (two times 550) the synchronous mean field iterations did not converge and 
these outliers were neglected in the plots. It is, however, possible to make the number of 
non- convergent cases even lower by using asynchronous instead of synchronous updates of 
the mean field parameters (Peterson & Anderson, 1987; Ansari, Hou, & Yu, 1995; Saul 
et al., 1996). Note that, in both cases, the second-order method improves on average on 
the standard (first order) variational results. Indeed, using only a factorised model with 
the second-order correction improves on the standard variational decimatable result. 

We also determined whether the second-order approach improves the accuracy of the 
prediction in each individual run. In order to determine this we used the paired difference, 

log Z 6rst 



\ogZ 



exact 



log Z exact 



\ogZ 



exact 



logZ 



second 



log Z exact 



(40) 



If A is positive the second order improves the first order approach. The corresponding 
results of our computer simulations are depicted in Figure 4. In all runs A was positive, 
corresponding to a consistent improvement in accuracy. 



5.2.2 Approximating the Means and Correlations 

There are many techniques that can be called upon to approximate moments and correla- 
tions. Primarily, the cumulant expansion, as we have described it, is intended to improve 
the accuracy in estimating the normalising constant, and not directly the moments of the 
distribution. Arguably the simplest way to approximate correlations is to use the stan- 
dard variational approach to find the best approximating distribution to Q\ and then to 
approximate the moments of Q\ by the moments of Qq. In this approach, however, certain 
correlations that are not present in the structure of the approximating distribution Qo, 
will be trivially approximated. For example, approximating (siSj)g i using the factorised 
distribution gives (siS^Q^ « ( s i)Q ( s j)q - We wm examine in this section some ways of 
improving the estimation of the correlations. 

One procedure that can be used to approximate correlations such as (siSj)g 1 is given 
by the so-called linear response theory (Parisi, 1988). This approach uses the relationship 
(5) in which the intractable normalising constant is replaced with its approximation (29). 

The approach that we adopt here, due to it's straightforward relationship to normalising 
constants, is given by considering the following relationship: 

(*i*i) Ql = Qi(« = Mi = 1) = e ^+^^M (41) 

where is the normalising constant of the original network in which nodes i and j have 

been removed and the biases are transformed to h' k = hk+Jik+Jjk- This means that we need 
to evaluate 0(n 2 ) normalising constants to approximate the correlations. To approximate 
the normalising constants, we use the higher-order expansion (33). Similarly, we attempt 
to obtain a more accurate approximation to the means (sj), based on the identity, 

(s i ) Ql =Qi(s i = l) = e h ^ (42) 
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(a) Factorised first order. The 
mean (absolute) error is 0.036 
(0.034 without outliers). 



(b) Factorised second order. The 
mean absolute error is 0.0079 
(0.0074 without outliers). 



Decimatable First Order 



Decimatable Second Order 




(c) Decimatable first order. The 
mean (absolute) error is 0.0186 
(0.0181 without outlier). 



(d) Decimatable second order. 
The mean absolute error is 
0.0031 (0.0027 without outlier). 



Figure 3: Approximation of the normalizing constant of a fully connected, eight-node Boltz- 
mann machine (see Figure 2(c)), whose parameters are randomly drawn from a 
zero-mean, Gaussian distribution with unit variance. A factorised model (see 
Figure 2(a)) is used as the approximating distribution in (a) and (b) and a deci- 
matable model (see Figure 2(b)) is used in (c) and (d). The histograms are based 
on 550 different random drawings of the parameters. For the readability of the 
histogram, we have excluded five outliers (four in (a) and (b), one in (c) and (d)), 
for which the synchronous mean field iteration did not converge. The mean error 
is the mean of the absolute value of equation (39). 



where is the normalising constant of the original network in which node i has been 

removed and the biases are transformed to h' k = hu + Jik- The "intractable" Boltzmann 
machine in our simulations has 8 fully connected nodes with biases and weights drawn 
from a standard zero-mean, unit-variance Gaussian distribution. In figures 5 and 6 we plot 
results for approximating the means (sj) and correlations (sjSj) respectively. As can be seen, 
the approach we take to approximate the correlations is an improvement on the standard 
variational factorised model. We emphasize that, since we are primarily concerned with 
approximating the normalising constant, there may be more suitable methods dedicated 
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(a) Factorised model. 



(b) Decimatable model. 



Figure 4: Difference between first and second-order approximation of the normalizing con- 
stant of a fully connected, eight-node Boltzmann machine using (a) a factorised 
and (b) a decimatable model. The histograms are based on the same 550 ran- 
dom drawings as used in Figure 3. All 1100 differences, including the five non- 
convergent runs, were larger than zero. Again for readability of the histogram, the 
non-convergent runs are not included. The mean difference was 0.0281 (0.0269 
without the non-convergent runs) and 0.0155 (0.0154 without the non- convergent 
run) for the factorised model and the decimatable model, respectively. 



to the task of approximating the correlations themselves. Indeed, one of the drawbacks of 
these perturbative approaches is that physical constraints, such as that moments should be 
bounded between and 1, can be violated. 



5.2.3 Learning 

The results of the previous section show that, for randomly chosen connection matrices 
J, the higher-order terms can significantly improve the approximation to the normalising 
constant of the distribution. However, in a learning scenario, such results may be of little 
interest since the connection matrix will become intimately related to the patterns to be 
learned — that is, J will take on a form that is appropriate for storing the patterns in the 
training set as learning progresses. 

We are therefore interested here in training networks on a set of visible patterns. That 
is, we split the nodes of the network into a set of visible Sy and a set of hidden units Sh with 
the aim of learning a set of patterns S v . . . S v with a Boltzmann machine Qi(Sy, Sh)- By 
using the KL divergence between an approximating distribution Qo{Sh\Sv) ° n the hidden 
units and the conditional distribution Qi(Sh\Sv) we obtain the following bound 



InQi(Sy) > 



/ Qo(S H 



S v )\nQ (S H \S v ) + 



J Qo(S H \ 



S v ) \nQi(S H , S v ) 



(43) 



For the case of Boltzmann machines, In Qi(Sh, Sy) = H(Sh,Sv) — InZ in which InZ is 
(assumed) intractable. In order to obtain an approximation to this quantity, we therefore 
introduce a further variational distribution, Qo(Sy, Sh) (not the same as Qo(Sh \Sy)) which 
can be used as in (10) to obtain a lower bound on InZ. Unfortunately, used in (43), this 
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(a) The mean absolute error in (b) The mean absolute error in 
approximating the means (si) is approximating the means (si) is 
0.0350. 0.0129. 



Figure 5: Approximating the means (sj) for 1000 randomly chosen 8 node fully connected 
Boltzmann machines. The weights were drawn from the standard normal distri- 
bution, (a) The standard variational approach in which the means are estimated 
by the means of the best approximating factorised distribution, (b) Using the ra- 
tio of normalising constants (41), in which second-order corrections are included 
in approximating the normalising constants. Using the ratio of normalising con- 
stants without the higher order corrections gave a mean error of 0.046, slightly 
worse than the standard result variational result. 



Factorized First Order estimates of correlations 



Factorized Second Order estimates of correlations 




(a) Mean absolute error = 0.0329 (b) Mean absolute error = 0.0103 



Figure 6: Approximating the correlations (sjSj) for 1000 randomly chosen 8 node fully 
connected Boltzmann machines. The weights were drawn from the standard 
normal distribution, (a) The standard variational approach in which the means 
are estimated by the means of the best approximating factorised distribution, and 
(b) using the ratio of normalising constants, including second-order corrections. 
Without second-order corrections, this gives a mean error of 0.0428. 
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lower bound on \nZ does not give a lower bound on the likelihood of the visible units 
Qi(Sv)- Nevertheless, we may hope that the approximation to the likelihood gradient is 
sufficiently accurate such that ascent of the likelihood can be made. Taking the derivative 
of (43) with respect to the parameters of the Boltzmann machine Qi(Sh, Sy), we arrive at 
the following learning rule for gradient ascent given a pattern Sv'- 

AJ = 7? (( s i s j)Q (S H \S v ) ~ ( SiS ^Qi(S H ,S v )) ( 44 ) 

where rj is a chosen learning rate. The correlations in the first term of (44) are straightfor- 
ward to compute in the case of using a factorised distribution Qo- The "free" expectations 
in the second term are more troublesome and need to be approximated in some manner. 
We examine using the standard factorised model approximations for the free correlations 
and more accurate "higher-order" approximations as described in section (5.2.2). Here, 
we are primarily concerned with monitoring the likelihood bound (43) under such gradient 
dynamics, so we will look at the exact bound value, it's approximation using a factorised 
model for \nZ (29), and it's second-order correction (33). 

We consider a small learning problem with 3 hidden units and 4 visible units. Ten visible 
training patterns were formed in which the elements were chosen to be on (value=l) with 
probability 0.4. In figure 7(a) we demonstrate variational learning in which the free statis- 
tics, required for the likelihood gradient ascent rule (44), are approximated with a simple 
factorised assumption, ( s i s j}Q 1 (^s H s v ) ~ m i m j(* j)- We display the time evolution of 
the exact training pattern likelihood bound 3 (43) (solid line), and two approximations to it: 
the first approximates the intractable In Z term by using the standard variational approach 
with a factorised model (dashed line). The second approximation uses the second-order 
correction to the factorised model value for InZ (dot-dash line). The learning rate was 
fixed at 0.05. In monitoring the progress of learning, the higher-order correction to the like- 
lihood bound can be seen to be a more accurate approximation of the likelihood compared 
to that given by the use of the standard variational approximation alone. Interestingly, 
using the second-order correction to the likelihood, a maximum in the likelihood is detected 
at a point close to saturation of the exact bound. This is not the case using the first-order 
approximation alone and, with the standard approximation to the likelihood, the dynamics 
of the learning process, in terms of the training set likelihood, are poorly monitored. 

In figure 7(b) the gradient dynamics are provided again by equation (44) but now 
with the free correlations { s i s j) q 1 (s h s v ) approximated by the more accurate ratio-of- 
normalising-constants method, as described in section (5.2.2). Again, we see an improve- 
ment in the accuracy of monitoring the progress of the learning dynamics by the simple 
inclusion of the higher-order term and, again, the higher-order approximation displays a 
maximum at roughly the correct position. The likelihood is higher than in figure 7(a) since 
the gradient of the likelihood bound is more accurately approximated through the more 
accurate correlation estimates. Note that the reason that the exact likelihood lower bound 
can be above 1 is due to pattern repetitions in the training set. The instabilities in learning 
after approximately 120 updates are due to the emergence of multimodality in the learned 
distribution Qi(Sh, Sv \ J), indicating that a more powerful approximation method, able 

3. In practice, the likelihood on a test set is more appropriate. However, the small number of possible 
patterns here (16) makes it rather difficult to form a representative test set. 
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(a) Learning using dynamics in 
which the free correlations are 
approximated using a factorised 
model. 



(b) Learning using dynamics in 
which the free correlations are 
given by the ratio of normalising 
constants approach. 



Figure 7: Learning a set of 10 random, 4-visible-unit patterns with a Boltzmann machine 
with 3 hidden units. The solid line is the exact value for the total likelihood bound 
(43) of all the patterns in the training set. The dashed line is the likelihood bound 
approximation based on using a factorised model to approximate In Z, equation 
(29). The dash-dot line uses the second order correction to InZ, equation (33). 



to capture such multimodal effects, should be used to obtain further improvements in the 
likelihood. 



6. Discussion 

In this article we have described perturbational approximations of intractable probability 
distributions. The approximations are based on a Taylor series using a family of prob- 
ability distributions that interpolate between the intractable distribution and a tractable 
approximation thereto. These approximations can be seen as an extension of the variational 
method, although they no longer bound the quantities of interest. We have illustrated our 
approach both theoretically and by computer simulations for Boltzmann machines. These 
simulations showed that the approximation can be improved beyond the variational bound 
by including higher-order terms of the corresponding cumulant expansion. 

Simulations showed that the accuracy in monitoring the training set likelihood during 
the learning process can be improved by including higher order corrections. However, 
these perturbational approximations cannot be expected to consistently improve on zeroth 
order (variational) solutions. For instance, if the distribution is strongly multimodal, then 
using a unimodal variational distribution cannot be expected to be improved much by the 
inclusion of higher-order perturbational terms. On the other hand, higher-order corrections 
to multimodal approximations may well improve the solution considerably. 
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We elucidated the relationship of our approach to TAP, arguing that our approach is 
expected to offer a more stable solution. Furthermore, the application of our approach to 
models other than the Boltzmann machine is transparent. Indeed, these techniques are 
readily applicable to other variational methods in a variety of contexts both for discrete 
and continuous systems (Barber & Bishop, 1998; Wiegerinck & Barber, 1998). 

One drawback of the perturbational approach that we have described is that known 
"physical" constraints, for example that moments must be bounded in a certain range, 
are not necessarily adhered to. It would be useful to develop a perturbation method that 
ensures that solutions are at least physical. 

We hope to apply these methods to variational techniques other than the standard 
Kullback-Leibler bound, and also to study the evaluation of other suitable criteria for the 
setting of the variational parameters. 

Acknowledgments 

We would like to thank Tom Heskes, Bert Kappen, Martijn Leisink, Peter Sollich and Wim 
Wiegerinck for stimulating and helpful discussions and the referees for excellent comments 
on an earlier version of this paper. 

Appendix A. 

In this appendix we describe how to optimize the variational bound when the normalization 
constant of an intractable Boltzmann machine is approximated using another tractable 
Boltzmann machine. For convenience, we denote the potential of an intractable Boltzmann 
Machine as 

(45) 

i 

where the "extended" parameters are given by 9i E {(9j, 6ij\i E [1, . . . , N], j = + ... , N]} 
and sj G {sj, SiSj\i E [1, . . . , N],j = [i + 1, . . . , N]}. The potential of the other tractable 
Boltzmann Machine is denoted as 

H = Oisi , (46) 

where denotes the set of parameters of the tractable Boltzmann Machine. 

We want to optimize the bound of equation (20) with respect to the free parameters 8j 
(J E i.e., 

— [logZ + <Atf) ] = 0, (47) 
which leads to 

(AHsj) -(AH) ( Sj ) = (48) 
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and thus to the fixed point equation 4 

9i=J2 J2i F ^~ 1 ]u F JK^K , (49) 
Je* K 

where the Fisher matrix is given by Fjj = (sisj} — (si) (sj} which depends on the free 
parameters 0, and Fyy is the Fisher matrix of the tractable Boltzmann machine. For a 
factorised approximating model, is a diagonal matrix and equation (49) simplifies to 
equation (32). For a decimatable Boltzmann machine, the elements of the Fisher matrix are 
tractable. The iteration can be performed either synchronously or asynchronously (Peterson 
Sz Anderson, 1987). We prefer here the synchronous case since for all N parameter-updates 
we only need to calculate and invert a single Fisher matrix instead of N matrices in the 
asynchronous case. In most applications, however, the asynchronous method seems to be 
advantageous with respect to convergence (Peterson Sz Anderson, 1987; Ansari et al., 1995; 
Saul et al., 1996). 
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